What is "jailbreaking" a large language model (LLM)?
“Jailbreaking” a large language model (LLM) means using an adversarially-designed text prompt to bypass the restrictions put on the model by its developer. The most capable LLMs that are publicly available (e.g., ChatGPT) have been shaped to not output certain types of text, including hate speech, incitation to violence, solutions to CAPTCHAs, and instructions for producing weapons.
Examples include the “grandma locket” image jailbreak, the “Do Anything Now” (DAN) jailbreak, and jailbreaks found by automatically generating adversarial prompts.
Overall, techniques like RLHF
A method for training an AI to give desirable outputs by using human feedback as a training signal.
Further reading:
- Lakera’s Gandalf is an interactive “game” where you can get a feel for jailbreaking by getting an LLM to reveal its “password”.